# Lab 20 - Hierarchical Clustering

We will look at two datasets today as we study hierarchical clustering. 

## Clustering the Iris data

The first is the [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which contains 50 samples each of 3 types of irises (Iris setosa, Iris virginica and Iris versicolor). The 4 measurements for each iris are the length and width (in cm) of the [sepals](https://en.wikipedia.org/wiki/Sepal) and petals.

The iris dataset is included in the sci-kit learn package, so we can load it from there.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.cluster.hierarchy as shc

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import AgglomerativeClustering

from sklearn.metrics import confusion_matrix

from sklearn import datasets

%matplotlib inline
pd.set_option("display.max_columns", None)

As with the Boston housing dataset, we can load the iris dataset from sci-kit learn. The iris dataset is also in dictionary format.

In [None]:
iris_dict = datasets.load_iris()
iris_dict

Type `iris_dict.keys()` to see what is included in the dataset.

Create a dataframe from the dictionary:

In [None]:
iris = pd.DataFrame(iris_dict.data, columns = iris_dict.feature_names)
iris.head()

Display your new dataset.

Since we will compare the rows using the Euclidean distance, we should scale all columns to be between 0 and 1. Do this below, storing the scaled data in the variable `iris_scaled` (see Lab 19).

Next we will plot a dendrogram (tree) of the distances between the irises.

In [None]:
plt.figure(figsize=(10, 7)) 
plt.title("Dendrograms") 
dend = shc.dendrogram(shc.linkage(iris_scaled, method='average'))

We can only plot the tree with the scipy package, so we need to use its linkage (clustering) function to produce the tree in a compatible format.

However, it is easier to use the sklearn package implementation for all other analysis.

In [None]:
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average') 
clusters = cluster.fit_predict(iris_scaled)
clusters

Add the clusters to the `iris` dataframe as a new column.

To see how our clustering did, let's create a scatter plot of two of the columns in iris (your choice of columns), colored by the cluster.

How well do you think the clustering algorithm worked from this graph?

Let's compare it with the same plot, colored by the true type of iris.

First, add a new column to the `iris` dataframe with the `target` data in the dictionary.

Next, plot the same two variables as above, colored by the true type of iris.

How do your two plots compare? What happens if you try a different pair of variables?

We can also use a confusion matrix to compare the predicted clusters with the real ones. Try it below.

What is the accuracy of this clustering method?

What happens if you try Ward linkage instead?

### Clustering labor market data

The Federal Reserve Bank of New York has information about the labor market for recent college graduates [here](https://www.newyorkfed.org/research/college-labor-market/college-labor-market_compare-majors.html).

The data in this table can be downloaded as an Excel file at the bottom of the page. If you open this file in Excel, you can save the last table as a CSV file. Alternatively, download the data as a CSV file from the course website.

Open the CSV file in Jupyter or another text editor to see if there are extra lines that need to be accounted for when reading it in. Recall that `read_csv()` has the optional parameters `skiprows` and `skipfooter` (ignore the warnings this parameter generates).

Additionally, set the index to be the `Major` column.

Recall we can look at the types of the columns using the pattern `df.dtypes`. 

In [None]:
labor.dtypes

Which two columns are not numerical types (integers or floats)? Can you guess why?

The following code removes the commas and converts the type to float.

In [None]:
labor["Median Wage Early Career"] = labor["Median Wage Early Career"].str.replace(",","").astype(float)
labor["Median Wage Mid-Career"] = labor["Median Wage Mid-Career"].str.replace(",","").astype(float)

Check that the columns all have a numerical type.

Scale the data in each column to be between 0 and 1.

We can put the scaled data back into a dataframe:

In [None]:
labor_scaled = pd.DataFrame(labor_scaled, columns = labor.columns, index = labor.index)
labor_scaled

Plot the dendrogram using Ward linkage. To label the leaves as the majors, add the parameter `labels = labor_scaled.index` to the `dendrogram()` function. 

You might also want to increase the leaf font size using the optional parameter `leaf_font_size`.

What do you notice about the tree? Do the relationships of the leaves make sense?

Let's compute the cluster using sklearn.

Add the predicted clusters as a column to the `labor` dataframe.

Check the clusters by using a filter to display only the rows in that cluster. Do the clusters make sense?